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Abstract. During the last twenty years there have been considerable 
methodological developments in the design and analysis of Phase 1, 
Phase 2 and Phase 1/2 dose-finding studies. Many of these develop- 
ments are related to the continual reassessment method (CRM), first 
introduced by O'Quigley, Pepe and Fisher (1990). CRM models have 
proven themselves to be of practical use and, in this discussion, we in- 
vestigate the basic approach, some connections to other methods, some 
generalizations, as well as further applications of the model. We obtain 
some new results which can provide guidance in practice. 

Key words and phrases: Bayesian methods, clinical trial, continual re- 
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1. INTRODUCTION 

The continual reassessment method (CRM) was 
introduced by O'Quigley, Pepe and Fisher (1990) as 
a design with which to carry out and analyze dose- 
finding studies in oncology. The purpose of these 
studies, usually referred to as Phase 1 trials of a 
new therapeutic agent, is to estimate the maximum 
tolerated dose (MTD) to be used in Phase 2 and 
Phase 3 trials. O'Quigley, Pepe and Fisher (1990) 
pointed out that standard methods in use then, and 
still in use now, fail to address the basic ethical 
requirements of experimentation with human sub- 
jects. Given the unknown or poorly understood re- 
lationship between dose and the probability of unde- 
sirable side effects (toxicity), it is inevitable, during 
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experimentation, that some patients will be treated 
at too toxic doses and some patients will be treated 
at doses too low to have any real chance of procur- 
ing benefit. Aside from being inefficient, the case 
against the standard designs is that more patients 
than necessary are treated in this way, either at too 
toxic a dose or, more usually, at too low a dose to 
provide therapeutic benefit. 

The rationale of the CRM is to concentrate as 
many patients as we can on doses at, or close to, 
the MTD. Doing so can provide an efficient esti- 
mate of the MTD while maximizing the number of 
patients in the study treated at doses with poten- 
tial therapeutic benefit but without undue risk of 
toxicity. A drawback of concentrating patients to 
a small number of dose levels, at and around the 
MTD, is that the overall dose-toxicity curve itself 
may be difficult to estimate. In practice, this tends 
not to be a serious drawback, since estimation of 
the entire dose-toxicity curve is rarely the goal of a 
dose-finding clinical trial. 

Phase 1 trials evaluating the toxicity of single agents 
are becoming less common, giving way to more com- 
plex studies involving multiple agents at various doses, 
heterogeneous groups of patients, and evaluations 
of both toxicity and efficacy. The standard methods 
are ill-equipped to handle these more complex sit- 
uations, and here, we will discuss developments of 
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the CRM and related methods for tackling various 
problems which arise in the context of dose finding. 
Whereas the standard method, even for the sim- 
plest situation fails to perform adequately, model 
based designs, while offering greatly improved per- 
formance for the simplest case, allow us to take on 
board those more involved situations that arise in 
practice (Braun, 2002; Faries, 2004; Goodman, Za- 
hurak and Piantadosi, 1995; Legedeza and Ibrahim, 
2002; Mahmood, 2001; O'Quigley, 2002a; O'Quigley 
and Paoletti, 2003; O'Quigley and Reiner, 1998; 
O'Quigley, Shen and Gamst, 1999; Piantadosi and 
Liu, 1996). 

We begin with the definitions and notation used 
in Phase 1 trials and an overview of the CRM as 
originally proposed by O'Quigley, Pepe and Fisher 
(1990). The next two sections outline Bayesian and 
likelihood-based inference for the CRM, providing 
results for the small-sample and large-sample prop- 
erties of the method. Section 5 gives extensions of 
the method and discusses modifications of the basic 
design. Section 6 presents related designs, again for 
the case of a single outcome whereas Section 7 con- 
siders two outcomes, one positive and one negative 
and describes the goal of locating the most success- 
ful dose (MSD). The article concludes with a discus- 
sion of future directions in the study of model-based 
methods for dose-finding studies. 

1.1 Doses, DLT, MTD and the MSD 

Traditional thinking in the area of cytotoxic anti- 
cancer treatments is to give as strong a treatment as 
we can without incurring too much toxicity. For the 
great majority of new cancer treatments — recent ad- 
vances in immunotherapy being possible exceptions — 
we consider that increases in dose correspond to in- 
creases in both the number of patients who will ex- 
perience toxic side effects as well as the numbers 
who may benefit from treatment. If we observe a 
complete absence of toxic side effects, then we would 
not anticipate observing any therapeutic effect, ei- 
ther for those patients in the study or for future 
patients. The Phase I trial then has for its goal the 
determination of some dose having an "acceptable" 
rate of toxicity. While it is true that the essential 
goal of the study is to improve treatment for future 
patients, ethical concerns dictate that we give the 
best possible treatment to the patients participating 
in the Phase I study itself. The highest dose level at 
which patients can be treated and where the rate of 



toxicity is deemed to be still acceptable is known as 
the MTD (maximal tolerated dose). 

On an individual level we can imagine being able 
to increase the dose without encountering the toxic 
effect of interest. At some threshold the individual 
will suffer a toxicity. An assumed model is the fol- 
lowing: at this threshold the individual suffers a tox- 
icity and, for all higher doses, the individual would 
also have encountered a toxicity. Such a model is rea- 
sonable for most situations and widely assumed. It 
remains nonetheless a model and might be brought 
under scrutiny in particular cases. The model stipu- 
lates that for all levels below the threshold, the indi- 
vidual would not suffer any toxicity and we call the 
threshold itself the individual's own maximum tol- 
erated dose (MTD). A dose-limiting toxicity (DLT) 
curve for the individual would be a (0, 1) step func- 
tion, the value indicating no toxicity and the value 
1 a toxicity. Thus, in the case of an individual, the 
(0, 1) step function for the DLT coincides with that 
for the MTD. 

Any population of interest can be viewed as be- 
ing composed of individuals each having their own 
particular MTD. Corresponding to each individual 
MTD we have a (0, 1) step function for the individ- 
ual's DLT. Over some set or population of individu- 
als, the sum of the DLT curves at any dose equates 
to the probability of toxicity at that same dose. For 
a population we fix some percentile so that, 100 x 9% 
say, have their own threshold at or below this level. 
The term MTD is often used somewhat loosely, and 
not always well defined. The more precise definition 
given in terms of a percentile involves 0. Different 
values of 8 would correspond to different definitions 
of the MTD. The values 0.2, 0.25 and 0.33 are quite 
common in practice. 

When information on efficacy, possibly through 
surrogate measures or otherwise through some mea- 
sure of response, is available in a timely way, then 
it makes sense to make use of such information. In 
the HIV setting, there have been attempts to si- 
multaneously address the problems of both toxicity 
and efficacy. The goal then becomes not one of find- 
ing the maximum tolerated dose but, rather, one of 
finding the MSD (most successful dose), that is, that 
dose where the probability of treatment failure, be 
it due to excessive toxicity or to insufficient evidence 
of treatment efficacy, is a minimum. The CRM can 
be readily adapted to address these kinds of ques- 
tions (O'Quigley, Hughes and Fenton, 2001; Zohar 
and O'Quigley, 2006a). 
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1.2 Notation 

We assume that we have available k doses; d±, . . . , 
dj., possibly multidimensional and ordered in terms 
of the probabilities, R(di), for toxicity at each of the 
levels, that is, R(di) < R(dj) whenever i < j. The 
MTD is denoted do and is taken to be one of the 
values in the set {d±, . . . , d&}. It is the dose that has 
an associated probability of toxicity, R(do), as close 
as we can get to some target "acceptable" toxicity 
rate 8. Specifically we define do 6 {di, . . . , d^} such 
that 

\R(d )-e\ 

(i) 

<\R(d e )-9\, e = l,...,k;d £ ^d . 

The binary indicator Yj takes the value 1 in the 
case of a toxic response for the jth entered subject 
(j = 1, . . . , n) and otherwise. The dose for the jth 
entered subject, Xj, is viewed as random taking val- 
ues Xj £ {d\ , . . . , df.};j = 1, • • • , n. Thus we can write 

Pr(Yj = l\Xj = xj) = R( Xj ). 

Little is known about R(-) and, given the n observa- 
tions, the main goal is to identify do. Estimation of 
all or part of R(dg), £ = 1, . . . , k, is only of indirect 
interest in as much as it may help provide informa- 
tion on do. 

There is an extensive literature on problems sim- 
ilar to that just described. The solutions to these 
problems, however, are mostly inapplicable in view 
of ethical constraints involved in treating human 
subjects. The patients included in the Phase I de- 
sign must, themselves, be treated "optimally," the 
notion optimal now implying for these patients a re- 
quirement to treat at the best dose level, taken to 
be the one as close as we can get to do- We then 
have two statistical goals: (1) estimate do consis- 
tently and efficiently and, (2) during the course of 
the study, concentrate as many experiments as pos- 
sible around do- Specifically, we aim to treat the jth 
included patient at the same level we would have 
estimated as being do had the study ended after the 
inclusion of j — 1 patients. 

2. CONTINUAL REASSESSMENT METHOD 

The continual reassessment method (CRM), pro- 
posed statistical design to meet the require- 
ments of the type of studies described above, was 
introduced by O'Quigley, Pepe and Fisher (1990). 
Many developments and innovations have followed, 



the basic method and variants having found a num- 
ber of other potential applications. Here, we recon- 
sider the original problem, expressed in statistical 
terms, since it is this problem that forged the method 
In this article we consider the main theoretical ideas 
and do not dwell on precise applications apart from 
for illustrative purposes. 

The method begins with a parameterized working 
model for R(xj), denoted by tjj(xj,a), for some one- 
parameter model ip(xj,a) and a defined on the set 
A. For every a, ip(x,a) should be monotone increas- 
ing in x and, for any x, ip{x,a) should be monotone 
in a. For every dj there exists some at € A such that 
R(di) =V'(di,ai), that is, the one-parameter model 
is rich enough, at each dose, to exactly reproduce 
the true probability of toxicity at that dose. There 
are many choices for i[>(x,a), including the simple 
Lehmann type shift model: 

log{- log ij}(di, a)} 

(2) 

= log{-logaj} + a, i = l,...,k, 

where < a± < ■ ■ ■ < < 1 and — oo < a < oo, hav- 
ing shown itself to work well in practice. This pa- 
rameterization allows for the support of the param- 
eter a to be on the whole real line and priors such 
as the normal or the logistic, having heavier tails, 
have been used here. The simple power model of 
O'Quigley, Pepe and Fisher (1990) required that 
support for the parameter a be restricted to the pos- 
itive real line. 

O'Quigley, Pepe and Fisher (1990) suggested that 
the aij, i = 1, . . . , k, be chosen to reflect a priori as- 
sumptions about the toxicity probabilities associ- 
ated with each dose. Lee and Cheung (2009) pro- 
vided a systematic approach to choosing the ini- 
tial values for the ai,i = l,...,k. Yin and Yuan 
(2009) used Bayesian model averaging to combine 
estimates from different sets of initial guesses at the 
ai,i = 1, . . . ,k. It should again be noted that the 
working model is not anticipated to represent the 
entire dose-toxicity curve. It suffices that the pa- 
rameterized working model be flexible enough to al- 
low for estimation of the dose-toxicity relationship 
at and close to the MTD. This point will be de- 
veloped more fully in Section 4, which summarizes 
the small- and large-sample properties of the CRM. 
Cheung and Chappell (2002) investigated the oper- 
ational sensitivity to different model choices. 

Once a model has been chosen and we have data 
in the form of the set Qj = {y±, x±, . . . , yj, Xj}, the 
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outcomes of the first j experiments, we obtain esti- 
mates R(di) (i = 1, . . . , k) of the true unknown prob- 
abilities R(di) (i = 1, . . . , k) at the k dose levels (see 
below). The target dose level is that level having as- 
sociated with it a probability of toxicity as close as 
we can get to 9. The dose or dose level Xj assigned 
to the jth included patient is such that 

LR(x 7 -)-0| 

(3) 

<\R(de) — 6\, £ = l,...,k;d£^Xj. 

This equation should be compared to (1). It trans- 
lates the idea that the overall goal of the study is 
also the goal for each included patient. The CRM 
is then an iterative sequential design, the level cho- 
sen for the (n + l)th patient, who is hypothetical, 
being also our estimate of do- After having included 
j subjects, we can calculate a posterior distribution 
for a which we denote by f(a, 0,). We then induce a 
posterior distribution for ip(di,a), i = 1, . . . , k, from 
which we can obtain summary estimates of the tox- 
icity probabilities at each level so that 

R(di) 

(4) 

= / ip(di,a)f(a,£lj) da, i = l,...,k. 

Ja£A 

Using (3) we can now decide which dose level to 
allocate to the (j + l)th patient. 

In the original version of the CRM, O'Quigley, 
Pepe and Fisher (1990) used an alternative estimate 
R{di) = tp(di,fi), i = l,...,k, where \i = 
J j^af(a,flj) da. This was done primarily to re- 
duce the amount of calculation required, a consid- 
eration of less importance today. O'Quigley, Pepe 
and Fisher (1990) completed the specification of the 
dose allocation algorithm by specifying a starting 
dose based on a prior specification of the dose level 
with probability closest to the target. 

3. BAYESIAN AND LIKELIHOOD INFERENCE 

In order to base inference only on the likelihood 
it is necessary to have the likelihood nonmonotone 
so that the estimates are not on the boundary of 
the parameter space. This is accomplished by having 
some initial escalation scheme until the data achieve 
at least one toxicity and one nontoxicity. We can 
regard the data obtained via this initial escalation 
scheme as, in some sense, empirical and use them as 
a data-based prior to the second part of the study. 
Thus, both Bayesian and likelihood alone, can all 



be put under a Bayesian heading. We use this in the 
following to study different Bayesian approaches to 
inference. 

3.1 Likelihood-Based Dose Allocations 

After the inclusion of the first j patients, the log- 
arithm of the likelihood can be written as 

j 

(5) 

3 

where any terms not involving the parameter a have 
been equated to zero. We suppose that Cj(a) is max- 
imized at a = aj. Once we have calculated aj we 
can next obtain an estimate of the probability of 
toxicity at each dose level a\ via R(d{) = ip{di,dj) 
(i = 1 , . . . , k) . On the basis of this formula the dose 
to be given to the (j + l)th patient, is deter- 

mined. Once we have estimated a we can also calcu- 
late an approximate 100(1 — a)% confidence interval 
for ip(xj+i,aj) as where 

= ip{x j+ i, {aj + z 1 _ a/2 v{a j ) 1/2 )}, 

= ip{ Xj+1 , (dj - 2i__ a/2 v(ai) 1/2 )}, 

where z a is the ath percentile of a standard normal 
distribution and v(dj) is an estimate of the variance 
of aj. For the model of (2) this turns out to be par- 
ticularly simple and we can write 

v - 1 (d j )= ^2 ip(xi,dj)(loga e ) 2 
e<j,yi=o 

/(l - Tp(x e ,dj)) 2 . 

Although based on a misspecified model these inter- 
vals turn out to be quite accurate, even for sample 
sizes as small as 16, and thus helpful in practice 
(O'Quigley, 1992). 

3.2 Prior Information on the Parameter a 

There are three distinct approaches which can be 
used in order to establish the prior information. These 
are: (1) postulate some numerically tractable and 
sufficiently flexible density g(a), (2) introduce a 
pseudo-data prior which indirectly will specify g(a), 
and (3) use some initial escalation scheme in a two- 
stage design until the first toxicity is observed. Let 
us consider these three approaches more closely. 
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A gamma prior for g(a) For the Lehmann shift 
model, on a logarithmic scale, given that A = (0, oo), 
O'Quigley, Pepe and Fisher (1990) suggested, as a 
natural candidate, 

g(a) = X c a c - 1 exp{-(Xa)}/T(c), 



r(c) 



exp(— u)u c 1 du 



the gamma density with scale parameter A and shape 
parameter c. The necessary steps in fitting a gamma 
prior on the basis of the upper and lower points 
of our prior confidence region have been described 
by Martz and Waller (1982). For a relatively sim- 
ple set-up involving no more than six doses and us- 
ing a coding for dose (not the actual dose itself), 
O'Quigley, Pepe and Fisher (1990) suggested that 
the simple exponential prior for a — a special case of 
the gamma prior with c and A both equal to 1 — 
would be satisfactory. Some authors have appealed 
to this simple exponential prior in different contexts, 
or more involved set-ups, and the resulting behavior 
of the method can be alarming (Moller, 1995). 

Pseudo-data prior In the place of a prior expressed 
as a specific parametric density function, pseudo- 
data priors create observations that are weighted in 
accordance with our degree of belief in their plausi- 
bility. Using pseudo-data y\,i = 1,...,K, the prior 
g{a) is defined from 



g(a) = exp 



(6) 



y|)log(l - ip(xi,a)) 



The prior "data" can be combined with the observed 
data. In consequence standard and widely available 
programs such as SAS or SPSS may be used directly 
to calculate the posterior mode without the need for 
numerical integration. The pseudo-data prior can be 
used to establish our best prior guesses which will 
be mirrored by the estimates of a obtained from fit- 
ting the pseudo-data alone. The imprecision which 
we wish to associate with this can be governed by 
a weighting coefficient Wj where < Wj < 1. This 
coefficient can be independent of j and we would 
usually require that Wj < Wj-\. The posterior den- 
sity is then 

f(a,VLj) = Aj 1 exp{w j log g(a) 

(7) 

+ (1 - Wj)Cj{a)}, 



where Aj = expjwjlog g(a) + (1 — Wj)Cj(a)} da. 
The added generality of allowing the dependence of 
the weights on j would rarely be needed and, in 
most practical situations, it suffices to take w as a 
constant small enough so that the prior has no more 
impact than deemed necessary. 

Uninformative priors For the model (2), O'Quigley 
(1992) suggested a normal prior having mean zero 
and variance a 2 , large enough to be considered non- 
informative. Such a concept can be made more pre- 
cise in the following way, at least for fixed sample 
designs. The mean and mode of the prior are at zero 
so that, should the true probabilities of toxicity ex- 
actly coincide with the oti then, the more informa- 
tive the prior the better we do, ultimately as the 
prior tends to being degenerate, that is, a 2 — > 0, we 
obtain the correct level always. Taking some dis- 
tance measure between the distribution of our fi- 
nal recommendation and the degenerate distribution 
putting all mass on the correct level, this distance 
will increase as our uncertainty, as measured by a 2 , 
increases. The curve of this distance, as a function of 
a 2 , will reach an asymptotic limit, further increases 
in a 2 having a vanishing influence on the error dis- 
tribution of final recommendation. The smallest fi- 
nite value of a 2 , such that the operating characteris- 
tics are sufficiently close to those obtained when a 2 
is infinite (in practice very large), corresponding to 
a diffuse and even improper prior, will provide the 
prior with the required behavior. 

An uninformative prior, in the sense that it does 
not favor any particular level, can be constructed 
readily in the light of the results of O'Quigley (2006) 
which partition the interval [A, B] for the parameter 
a into k subintervals Si (i = 1, . . . , k). If a G Si, then 
dose level di corresponds to the MTD. For k dose 
levels we simply associate the probability mass 1/k 
to each of the k subsets Si ■ Clearly this approach is 
readily extended to the informative case by putting 
priors favoring some levels more than others, either 
on the basis of clinical information or simply out of 
a desire to influence the operating characteristics in 
some particular way. An example for the frequent 
case k = 6 would be to associate the prior 0.05 with 
level 1, and the values 0.19 with the other five levels. 
This would result in steering us away from level 1 
in favor of the other levels, unless the accumulating 
data begin to weigh against our conjecture that level 
1 is unlikely to be the right level. 
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Trial History 



level — 
toxicity o 



10 



13 16 
Subject No 



19 



22 



25 



28 



Fig. 1. A typical trial history using rapid early escalation; target is level 7. 



Data-based prior in two-stage designs In order to 
be able to maximize the log-likelihood on the inte- 
rior of the parameter space we require heterogene- 
ity among the responses, that is, at least one toxic 
and one nontoxic response (Silvapulle, 1981). Oth- 
erwise the likelihood is maximized on the boundary 
of the parameter space and our estimates of R{di) 
(i = l,...,k) are trivially either zero, 1, or, depend- 
ing on the model we are working with, may not 
even be defined. In the context of "pure likelihood" - 
based designs O'Quigley and Shen (1996) argued 
for two-stage designs whereby an initial escalation 
scheme provided the required heterogeneity. The ex- 
periment can be viewed as not being fully underway 
until we have some heterogeneity in the responses. 
These could arise in a variety of different ways: use 
of a standard Up and Down approach, use of an 
initial Bayesian CRM as outlined below, or use of 
a design believed to be more appropriate by the in- 
vestigator. Once we have achieved heterogeneity, the 
model kicks in and we continue as prescribed above 
(estimation-allocation). We can also consider this 
initial escalation as providing empirical data. Con- 
ditional upon these data we then proceed to the sec- 
ond stage. The data obtained at the end of the first 
stage can be viewed as providing an empirical prior. 
In this way, all the approaches can be grouped under 



a Bayesian umbrella. The essential differences arise 
through the different ways of specifying the prior. 

Using empirical data to construct a prior as the 
first stage of a two-stage design can afford us a great 
deal of flexibility. The initial exploratory escalation 
stage is followed by a more refined homing in on the 
target. Such an idea was first proposed by Storer 
(1989) in the context of the more classical Up and 
Down schemes. His idea was to enable more rapid 
escalation in the early part of the trial where we 
may be quite far from a level at which treatment 
activity could be anticipated. Moller (1995) was the 
first to use this idea in the context of CRM designs. 
Her idea was to allow the first stage to be based 
on some variant of the usual Up and Down proce- 
dures. In the context of sequential likelihood esti- 
mation, the necessity of an initial stage was pointed 
out by O'Quigley and Shen (1996) since the likeli- 
hood equation fails to have a solution on the interior 
of the parameter space unless some heterogeneity in 
the responses has been observed. Their suggestion 
was to work with any initial scheme, Bayesian CRM 
or Up and Down, and, for any reasonable scheme, 
the operating characteristics appear relatively insen- 
sitive to this choice. 

Here we describe an example of a two-stage design 
that has been used in practice (see Figure 1). There 
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were many dose levels and the first included patient 
was treated at a low level. As long as we observe very 
low-grade toxicities then we escalate quickly, includ- 
ing only a single patient at each level. As soon as we 
encounter more serious toxicities then escalation is 
slowed down. Ultimately we encounter dose-limiting 
toxicities at which time the second stage, based on 
fitting a CRM model, comes fully into play. This is 
done by integrating this information and that ob- 
tained on all the earlier non-dose-limiting toxicities 
to estimate the most appropriate dose level. Con- 
sider the following example which uses information 
on low-grade toxicities in the first stage in order to 
allow rapid initial escalation (see Table 1). Specifi- 
cally we define a grade severity variable S(i) to be 
the average toxicity severity observed at dose level i, 
that is, the sum of the severities at that level divided 
by the number of patients treated at that level. The 
rule is to escalate providing S(i) is less than 2. Fur- 
thermore, once we have included three patients at 
some level, then escalation to higher levels only oc- 
curs if each cohort of three patients does not ex- 
perience dose-limiting toxicity. This scheme means 
that, in practice, as long as we see only toxicities of 
severities coded or 1, then we escalate. The first 
severity coded 2 necessitates a further inclusion at 
this same level and, anything other than a severity 
for this inclusion, would require yet a further inclu- 
sion and a non-dose-limiting toxicity before being 
able to escalate. This design also has the advantage 
that, should we be slowed down by a severe (severity 
3), albeit non-dose- limiting toxicity, we retain the 
capability of picking up speed (in escalation) should 
subsequent toxicities be of low degree (0 or 1). This 
can be helpful in avoiding being handicapped by an 
outlier or an unanticipated and possibly not drug- 
related toxicity arising early in the study. Once a 
dose-limiting toxicity is encountered the initial es- 
calation stage is brought to a halt and the accumu- 
lated data taken as our empirical prior. 

Table 1 

Toxicity "grades" (severities) for trial 



Severity Degree of toxicity 



No toxicity 

1 Mild toxicity (non-dose-limiting) 

2 Nonmild toxicity (non-dose-limiting) 

3 Severe toxicity (non-dose-limiting) 

4 Dose-limiting toxicity 



3.3 An Illustration 

An example of a two-stage design involving 16 
patients was given by O'Quigley and Shen (1996). 
There were six levels in the study, maximum likeli- 
hood was used, and the first entered patients were 
treated at the lowest level. The design was two- 
stage. The true toxic probabilities were R(d\) = 0.03, 
R(d 2 ) = 0.22, R(d 3 ) = 0.45, R{d A ) = 0.6, R(d 5 ) = 0.8 
and R(de) = 0.95. The working model was that given 
by (2) where an = 0.04, a 2 = 0.07, q 3 = 0.20, a 4 = 
0.35, as = 0.55 and ocq = 0.70. The targeted toxi- 
city was given by 9 = 0.2 indicating that the best 
level for the MTD is given by level 2 where the true 
probability of toxicity is 0.22. A grouped design was 
used until heterogeneity in toxic responses was ob- 
served, patients being included, as for the classical 
schemes, in groups of three. The first three patients 
experienced no toxicity at level 1. Escalation then 
took place to level 2 and the next three patients 
treated at this level did not experience any toxic- 
ity either. Subsequently two out of the three pa- 
tients treated at level 3 experienced toxicity. Given 
this heterogeneity in the responses the maximum 
likelihood estimator for a now exists and, follow- 
ing a few iterations, could be seen to be equal to 
0.715. We then have that R(di) = 0.101, R(d 2 ) = 
0.149, R(d 3 ) = 0.316, R{d 4 ) = 0.472, R(d 5 ) = 0.652 
and R(ds) = 0.775. The 10th entered patient is then 
treated at level 2 for which R{d 2 ) = 0.149 since, from 
the available estimates, this is the closest to the tar- 
get 6 = 0.2. The 10th included patient does not suf- 
fer toxic effects and the new maximum likelihood 
estimator becomes 0.759. Level 2 remains the level 
with an estimated probability of toxicity closest to 
the target. This same level is in fact recommended 
to the remaining patients so that after 16 inclusions 
the recommended MTD is level 2. The estimated 
probability of toxicity at this level is 0.212 and a 
90% confidence interval for this probability is esti- 
mated as (0.07, 0.39). 

4. LARGE-SAMPLE AND SMALL-SAMPLE 
PROPERTIES 

Extensive simulations (O'Quigley, Pepe and Fisher, 
1990; O'Quigley and Shen, 1996; O'Quigley, 1999; 
Iasonos et al., 2008), over wide choices of possi- 
ble true unknown dose-toxicity situations, show the 
method to behave in a mostly satisfactory way, rec- 
ommending the right level or close levels in a high 
percentage of situations and treating in the study 
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itself a high percentage of included patients, again, 
at the right level or levels close by. Cheung (2005), 
O'Quigley (2006) and Lee and Cheung (2009) ob- 
tained theoretical results which not only provide 
some confidence in using the method but can also 
provide guidance in the choice and structure of work- 
ing models. Even though models are misspecified, in- 
ference is still based on an estimating equation taken 
from the derivative of the log-likelihood. Thus, Shen 
and O'Quigley (1996) defined 




Some restrictions on ip are needed (O'Quigley, 2006). 
In particular, there must exist constants a\,...,ak 
6 [A, B] such that ip(di,di) = Ri, the function ip sat- 
isfies ijj(di,B) < 9 < tp(di,A), and there is a unique 
a Q £ (oi,...,ojb), tp(d Q ,ao) = R(d ) = 9 . In general, 
$o will not be equal to 9 but will be as close as we 
can get given the available doses. We require the es- 
timating function to respect a standard condition of 
estimating functions which is that 

tp' -tp' 

s(t,x,a) =t—{x,a} + (1 —t)- r{x,a} 

tp 1 — tp 

is continuous and strictly monotone in a. We define 
I n (a) = n~ l YTj=\ s{R(xj),Xj,a}. 

It is not typically the case that ip(di,ao) = R(di) 
for i = 1, . . . , k. However, at least in the vicinity of 
the MTD, this will be approximately true, an idea 
that can be formalized (Shen and O'Quigley, 1996) 
via the definition of the set 

S(a ) = {a: \ip(d ,a) - 9\ < \ip(di,a) - 6\ 

(8) 

for all di do}. 

Shen and O'Quigley (1996) showed that convergence 
follows if, for i = l,...,k, a« 6 S(ao). O'Quigley 
(2006) showed that, for each 1 < % < k — 1, there 
exists a unique constant Ki such that 9 — ip(xi, Ki) = 
ip(xi + \,Ki) — 9 > 0. The constants Ki naturally give 
rise to a partitioning of the parameter space L4,-B]. 
Letting kq = A and Kk = B, we can write the in- 
terval [A, B] as a union of nonoverlapping intervals 

whereby [A, B] = Ui=i[ K *-i' K i)- This partition is of 
particular value in establishing prior distributions 
which can translate immediately into priors for the 
dose levels themselves. It is also of value in deriving 
results concerning the coherence, stability and con- 
vergence of the algorithm (Cheung and Chappell, 
2002; O'Quigley, 2006). 



Convergence to the MTD stems from the fact that 
sup ae M )S j \I n {a) — I n (a)\ converges almost surely to 
zero (Shen and O'Quigley, 1996) and that we can re- 
express I n {a) as a sum over the k dose levels rather 
than a sum over the n subjects; in particular we 
have that I n (a) = J2i=i 7r n(di)s{R(di),di,a}. Sup- 
posing that the solution to the equation I n (a) = is 
d n and that at is the unique solution to the equa- 
tion s{R(di), di, a} = 0, then d n will fall into the in- 
terval Si(ao). Since d n solves I n (a) = 0, then, al- 
most surely, d n € S(o,q), so that, for n sufficiently 
large, x n+ \ = do. Since there are only a finite num- 
ber of dose levels, x n will ultimately settle at do. 
Rather than appeal to the set S(ao), which quan- 
tifies the roughness of the working approximation 
to the true dose-toxicity function in the vicinity of 
the MTD, and which guarantees convergence to the 
MTD when all of the aj belong to this set, Che- 
ung (2005) used a related approach which appeals 
to a more flexible — in many ways more realistic — 
definition of the MTD whereby probabilities of tox- 
icity within some given range are all taken to be 
acceptable. Convergence can then be shown to ob- 
tain without such restrictive conditions as those de- 
scribed above. 

4.1 Efficiency 

O'Quigley (1992) proposed using 9 n = ip(x n+ i,d n ) 
to estimate the probability of toxicity at the rec- 
ommended level x n+ i, where d n is the maximum 
likelihood estimate. An application of the 5-method 
(Shen and O'Quigley, 1996) shows that the asymp- 
totic distribution of \/n{9 n — R(do)} is iV{0, 6>o(l — 
9o)}- The estimate then provided by CRM is fully 
efficient for large samples. This is what our intu- 
ition would suggest given the convergence proper- 
ties of CRM. What actually takes place in finite 
samples needs to be investigated on a case by case 
basis. The relatively broad range of cases studied 
by O'Quigley (1992) show a mean squared error for 
the estimated probability of toxicity at the recom- 
mended level under CRM to correspond well with 
the theoretical variance for samples of size n, were 
all subjects to be experimented at the correct level. 
Some of the cases studied showed evidence of super- 
efficiency, translating nonnegligible bias that hap- 
pens to be in the right direction, while a few others 
indicated efficiency losses large enough to suggest 
the possibility of better performance. 

A useful tool in studies of finite sample efficiency is 
the idea of an optimal design. We can derive a non- 
parametric optimal design (O'Quigley, Paoletti and 
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Maccario, 2002) based upon no more than a mono- 
tonicity assumption. Such an optimal design is not 
generally available in practice but can serve as a gold 
standard in theoretical studies, playing a role similar 
to that of the Cramer-Rao bound. Comparisons be- 
tween any suggested method and the optimal design 
enable us to quantify just how much room there is 
for potential improvement. Further evidence of the 
efficiency of the CRM was provided by the work of 
O'Quigley, Paoletti and Maccario (2002), where the 
CRM is compared to the nonparametric optimal de- 
sign. In the cases studied in that article and in that 
of Paoletti, O'Quigley and Maccario (2004), poten- 
tial for improvement is seen to be limited, with the 
identification of the MTD by the two-stage CRM 
design being only slightly inferior to that of the op- 
timal design. 

4.2 Nonidentifiability of Fully Parameterized 
Models 

Under the conditions outlined above we will ulti- 
mately only include patients at dose level do . Under 
very much broader conditions (Shen and O'Quigley, 
1996) we can guarantee convergence to some level, 
not necessarily do but one where the probability of 
toxicity will not be far removed from that at do- 
The consequence of this is that, for the most com- 
mon case of a single homogeneous group of patients, 
we are obliged to work with an underparameterized 
model, notably a one-parameter model in the case 
of a single group. Although a two-parameter model 
may appear more flexible, the convergence property 
of CRM means that ultimately we will not obtain 
information needed to fit two parameters. Having 
settled at dose level di, the only quantity we can 
estimate is R(di) which can be done consistently 
in light of the Glivenko-Cantelli lemma. Under our 
model conditions we have that R{di) = tp(di,ai) and 
that hj will converge almost surely to dj. Adding 
a second parameter can only overparameterize the 
situation and, for example, the commonly used logis- 
tic model has an infinite number of combinations of 
the two parameters which lead to the same value of 
R(di). A likelihood procedure can then be unstable 
and may even break down, whereas a two-parameter 
fully Bayesian approach (Gatsonis and Greenhouse, 
1992; Whitehead and Williamson, 1998) may work 
initially, although somewhat artificially, but behave 
erratically as sample size increases and the struc- 
tural rigidity provided by the prior gradually wanes. 
This is true even when starting out at a low or the 



lowest level, initially working with an Up and Down 
design for early escalation, before a CRM model is 
applied. Indeed, any design that ultimately concen- 
trates all patients from a single group on some given 
level can fit no more than a single parameter with- 
out running into problems of identifiability. 

5. EXTENDED CRM DESIGNS 

The simple model of (2) can be extended to a class 
of models denoted by ip m (xj,a) for m = 1, . . . , M 
where there are M members of the class. Take, for 
example, 

ip m {di,a) = a*f (a \ 

(9) 

i = 1, . . . , k;m = 1, . . . , M, 

where < a m \ < ■ ■ < a m k < 1 and — oo < a < oo, 
as an immediate generalization of (2). Prior infor- 
mation concerning the plausibility of each model is 
catered for by 7t(to), m = 1, . . . , M, where ir(m) > 
and where Xlm 7r ( ?Tl ) = 1- When each model is 
given the same initial weighting, then we have that 
7r(m) = 1/m. If the data are to be analyzed under 
model to, then, after the inclusion of j patients, the 
logarithm of the likelihood can be written as 

3 

£ m j{a) = ^ 2/i log ip m (x e ,a) 

(10) 

3 

1=1 

where any terms not involving the parameter a have 
been ignored. Under model to we obtain a summary 
value of the parameter a, in particular the maximum 
of the posterior mode and we refer to this as a m j. 
Given the value of d m j under model to, we have an 
estimate of the probability of toxicity at each dose 
level di via R(di) = ift m (di, a m j) (i = l,...,k). On 
the basis of this formula, and having taken some 
value for to, the dose to be given to the (j + l)th 
patient, Xj+i, is determined. Thus, we need some 
value for to and we make use of the posterior prob- 
abilities of the models given the data Denoting 
these posterior probabilities by n(rn\Qj), then 

7r(m|fL-) 

(11) 

7r ( r ") J^exp{£ mi (ii)}#(u)dit 
Em=i 7r ( m ) f-oo exp{C mj (u)}g(u) du 
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The estimated values of 7r(m|f2j) can help us decide 
between models which have physical implications of 
interest to us. As an example suppose that there 
exists significant heterogeneity among the patients 
and this is simplified to the case of two groups. 

5.1 A Simple Heterogeneity Model 

As in other types of clinical trials we are essen- 
tially looking for an average effect. Patients natu- 
rally differ in the way they may react to a treat- 
ment and, although hampered by small samples, we 
may sometimes be in a position to specifically ad- 
dress the issue of patient heterogeneity. One exam- 
ple occurs in patients with acute leukemia where 
it has been observed that children will better tol- 
erate more aggressive doses (standardized by their 
weight) than adults. Likewise, heavily pretreated pa- 
tients are more likely to suffer from toxic side ef- 
fects than lightly pretreated patients. In such situ- 
ations we may wish to carry out separate trials for 
the different groups in order to identify the appro- 
priate MTD for each group. Otherwise we run the 
risk of recommending an "average" compromise dose 
level, too toxic for a part of the population and sub- 
optimal for the other. Usually, clinicians carry out 
two separate trials or split a trial into two arms af- 
ter encountering the first DLTs when it is believed 
that there are two distinct prognostic groups. This 
has the disadvantage of failing to utilize information 
common to both groups. The most common situa- 
tion is that of two samples where we aim to carry 
out a single trial keeping in mind potential differ- 
ences between the two groups. A multisample CRM 
is a direct generalization although we must remain 
realistic in terms of what is achievable in the light 
of the available sample sizes. 

Introduce a binary variable Z taking the value 
for the first group and 1 for the second group. Sup- 
pose also that we know that, for the second group, 
the probability of toxicity can only be the same or 
higher than the first group. For this situation con- 
sider the following models: 

1. Model 1: m = l 

Pr(Y = l\di,z = 0) = ip(di,a), i = 1, . . . , k, 
Pr(Y = l\di,z = 1) = ip(di,a), i = l,...,k, 

2. Model 2: m = 2 

Pr(y = l\di,z = 0) = 4>(di,a), i = l,...,k, 
Pv(Y = l\d i ,z = l) = iP(d i+1 ,a), 



i = 1, . . . , k — 1, 

Pr(y = l|^,z = l)=V(4,a), i = k. 

If the most plausible model has m = 1, then we 
conclude that there is no difference between the groups. 
If m = 2, then we conclude that for the second group 
the probability of toxicity at any level is the same as 
that for a subject from the first group but treated 
at one level higher. The truth will be more subtle 
but since we have to treat at some level we force 
this decision to be made at the modeling stage. The 
idea extends, of course, to several levels, positive as 
well as negative directions to the difference, and to 
other factors such as treatment schedules. 

5.2 Randomization and Two-Parameter Models 

Suppose that j subjects are already entered in the 
trial. Instead of systematically selecting the level es- 
timated as being closest to the target, a different 
approach would be to use the available knowledge 
to randomly select a level from d\,...,dk according 
to some given discrete distribution. This distribu- 
tion does not have to be fixed in advance but can 
depend on the available levels and the current es- 
timate of the MTD. Let Xj+i be defined as before. 
However, we will no longer allocate systematically 
subject j + I to dose level Xj+i as before. Instead 
we allocate to Wj+i where we define 

' k 

d m+A I{x j+ i =d m ,m< k}; 

m=l 



^ d m -Al{xj + i =d m ,m> 1}; 
m=l 

k R(x j+1 ) > 9 

and where A is a Bernoulli(0, 1) random variable 
with parameter typically of value 0.5. In words, in- 
stead of allocating to the level closest to R(xj + \) we 
allocate, on the basis of a random mechanism, to 
the level just above R{xj+\) or the level just below 
R(xj + \). In the cases where R(xj+i) is lower than 
the lowest available level, or higher than the highest 
available level, then the allocation becomes, again, 
systematic. The purpose of the design is then to be 
able to sample on either side of the target. Aside 
from those cases in which the lowest level appears 
to be more toxic than the target or the highest level 
less toxic than the target, observations will tend to 
be concentrated on two levels. One of these levels 
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will have an associated estimated probability below 
the target while the other level will have an esti- 
mated probability above the target. 

An immediate consequence of forcing experimen- 
tation to take place at more than a single level is 
that the nonidentifiability described above changes. 
It is now possible to estimate more than a single pa- 
rameter, for example the rate of toxicity at, say, the 
lower of the two levels as well as the rate of toxicity 
at the next level up. Working with a one-parameter 
model and randomizing to two levels, say dp and 
g?£+i, the estimate a will converge to the solution of 
the equation 



7T(d e )^R e ^(d e ,a) + {l-R e )j 

f W/ 
{1 -7r(d e )}lR e+1 — (d e+1 ,a) 



(de,a) 



+ (l-%)j-^(d w ,o)J=0, 

where ir(di) is the stable distribution (long-term pro- 
portion) of patients included at level dg. Comparing 
this equation with the estimating equation for the 
standard case without randomization, we can see 
that, unless the working model generates the obser- 
vations, we will not obtain consistent estimates of 
the probabilities of toxicities at the two doses of the 
stable distribution. However, introducing a second 
parameter into the model, one which describes the 
differences between the probabilities of toxicity at 
the two dose levels, we obtain consistent estimates 
at these two doses of the stable distribution. To see 
this it is enough to parameterize the probability of 
toxicity at the current level di as ip(di,a) and that 
at level de+i by tp(di, a + b). The estimates will con- 
verge to the solution of 



ir(di) 



Re—\de 



+ (i - Rt 



l - V 



{di 



{1 - ir(d e )}iRt +1 — (d e+1 ,a + b) 

+ {l-R e+ i)z^(d e+1 ,a + b)X=0, 

for which each term separately can be then accom- 
modated within the framework describing consis- 
tency given above. In practice we would use a model 
such as the logistic where 

exp(aafc + b) 



ip(d k ,a,b) 



1 + exp(aafc + b) ' 



which, once settling takes place, is then a saturated 
model. 

6. RELATED DESIGNS 

There have been many suggestions in the litera- 
ture for possible modifications of the basic design. 
Also, some apparently alternative designs turn out 
to be equivalent to the basic design. In this section 
we consider some of these designs. 

6.1 Escalation with Underdose/Overdose 
Control 

Babb, Rogatko and Zacks (1998) argued that the 
main ethical concern was not so much putting each 
patient at a dose estimated to be the closest to 
the MTD but rather putting each patient at a dose 
for which the probability of it being too great was 
minimized. The difference may be subtle but would 
be a basis for useful, and important, discussions 
with the clinicians involved. These discussions help 
make explicit the goals, both in terms of final rec- 
ommendation and for those patients included in the 
study. There may be situations where a parallel con- 
cern might focus on the underdosing rather than 
the overdosing. For an approach based on the CRM 
we would simply modify the definition of the dose 
level "closest to the target" to be asymmetric. Pos- 
itive distances could be magnified relative to nega- 
tive ones resulting in a tendency to assign below the 
MTD rather than above it. 

Babb, Rogatko and Zacks (1998) approached the 
problem differently by focusing on the posterior dis- 
tribution of the MTD and suggesting a loss function 
that penalizes overdosing to a greater degree than 
underdosing. Tighiouart, Rogatko and Babb (2005) 
developed the idea further, investigating a number 
of prior distributions. Despite this change in empha- 
sis, there is no fundamental difference between these 
approaches and the CRM, aside from the making 
use of a particular distance measure. The methods 
of Babb, Rogatko and Zacks (1998) and Tighiouart, 
Rogatko and Babb (2005) allow for continuous dose 
levels. Although the CRM is most frequently ap- 
plied in cases with a fixed set of dose levels, it can 
be adapted to allocate patients on dose levels other 
than the fixed set of doses. 

6.2 ADEPT and Two- Para meter CRM 

O'Quigley, Pepe and Fisher (1990) studied two- 
parameter CRM models based on the logistic dis- 
tribution. For large samples the parameters are not 
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identifiable and we may expect that this could lead 
to unstable or undesirable operating characteristics. 
For small to moderate finite samples the behavior 
can be studied on a case by case basis. Even when 
the two-parameter model correctly generated the 
observations, the simulations of O'Quigley, Pepe and 
Fisher indicated that the one-parameter CRM would 
work better for sample sizes up to around 25. 

Whitehead and Brunier (1995) suggested working 
with the two-parameter logistic model and using a 
pseudo-data prior. This has been put together as a 
software package and is called ADEPT. The term 
ADEPT is used to describe either the software itself 
or the approach which would be equivalent to a two- 
parameter CRM with a data-based prior. Gerke and 
Siedentop (2008) argued that ADEPT is to be pre- 
ferred to standard CRM in terms of accuracy of rec- 
ommendation. This conclusion was based on a study 
of three, rather particular, situations in which the 
target dose lies exactly at the midpoint between two 
of the available doses. They define the lower of these 
two doses as being the MTD. Gerke and Siedentop's 
definition of the MTD is not the usual one which, 
had it been used in their simulations, would have 
resulted in the very opposite conclusion. The usual 
one, and that used in O'Quigley, Pepe and Fisher, 
is the dose which is the closest to the target. Should 
two doses be equidistant from the target then, logi- 
cally, either one could be considered to be the MTD. 
This observation alone would completely reverse the 
findings of Gerke and Siedentop (Shu and O'Quigley, 
2008). 

The ADEPT program leans more formally on 
Bayesian decision procedures which, it is argued 
(Whitehead and Brunier, 1995), represent a general- 
ization of the CRM since, instead of basing sequen- 
tial patient allocation on the anticipated gain for the 
next included patient or group of patients, allocation 
could be based on the gain for the variance of esti- 
mators. In the case of more than one parameter we 
could use different combinations of the individual 
variances and covariances, in particular the deter- 
minant of the information matrix. Whitehead and 
Brunier argued that "gain functions can be devised 
from the point of view of the investigator (accuracy 
for future patients) or from the point of view of the 
next included patient, as in the CRM. Weighted av- 
erages of these two possibilities can be used to form 
compromise procedures." 

However, under current guidelines, it is not pos- 
sible to use a procedure which sacrifices the point 



of view of the current patient to that of future pa- 
tients. It is only future patients who may benefit 
from improved precision (the point of view of the 
investigator) and, although, in medical experimen- 
tation, arguments have been and will continue to be 
put in such a direction, such logic is not currently 
considered acceptable. Experimentation on an indi- 
vidual patient can only be justified if it can be ar- 
gued that the driving goal is the benefit of that same 
patient. Basing allocation on anything other than 
patient gain, and, in particular, the gain for future 
patients, would be a violation of the usual ethical 
criteria in force in this area. In practice, only patient 
gain is used, and so ADEPT is essentially the same 
as two-parameter CRM. In animal experimentation 
or in experimentation in healthy volunteers, where 
severe side effects are considered very unlikely, a case 
could be built for using other gain functions. 

6.3 Curve-Free Designs 

Rather than appeal to a working model ip(x,a) 
and have a follow some distribution, we can employ 
a multivariate distribution of dimension k and con- 
sider the ordered probabilities at the k levels to be 
the quantities of interest. Prior median or mean val- 
ues for the distribution of R(di), the probability of 
toxicity at dose di, are provided by the clinician. 
We then work with a multivariate law that is flexi- 
ble enough to allow reasonable operating character- 
istics, escalating quickly enough in the absence of 
observed toxicities and not being unstable or overre- 
acting to toxicities that occur. Gasparini and Eisele 
(2000) argued in favor of experimenting this way. 
They suggested working with a product of beta pri- 
ors (PBP) upon reparameterizing whereby 

#1 = 1- R(di), 



and then letting the 9i (i = 1, . . . , k) have indepen- 
dent beta distributions. Since R(di) = 1 — 6q#2 •••0% 
the monotonicity constraint is respected. The distri- 
bution of a product of beta distributions is complex 
but the authors argue that we can approximate this 
well by taking the product itself to be beta. We then 
fit such a beta using the first two moments from the 
product, easily achieved under the condition of inde- 
pendence of the 0{. Gasparini and Eisele (2000) pro- 
vided some guidelines for setting up the prior for this 
multivariate law based on consideration of operating 
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characteristics. O'Quigley (2002b) demonstrated an 
equivalence between a curve-free design and a CRM 
design in that, given a particular specification of a 
curve-free design, there exists an equivalent speci- 
fication of a CRM design. This is also true in the 
other direction. By equivalent we mean that all op- 
erational characteristics are the same. However, this 
still remains only an existence result and it is not yet 
known how to actually find the equivalent designs. 
Cheung (2002) noted that in cases where low toxic- 
ity percentiles are targeted, the use of the nonpara- 
metric approach with a vague prior can lead to dose 
allocation that tends to be confined to suboptimal 
levels. Cheung (2002) exploited the connection with 
the CRM to suggest informative priors that can help 
alleviate this problem. 

Whitehead et al. (2010) suggested an approach 
in which the probabilities of toxicity at each dose, 
rather than belonging to some continuum, are only 
allowed to belong to a small discrete set. In practice, 
we do not need to distinguish a probability of tox- 
icity of 0.32 from a probability of 0.34. They could 
be considered the same, or, in some sense at least, 
equivalent. The idea is not unrelated to the idea 
of Cheung and Chappell (2002) on indifference in- 
tervals. Performance of Whitehead and colleague's 
method is comparable to the CRM. 

7. IDENTIFYING THE MOST SUCCESSFUL 
DOSE (MSD) 

In the context of dose finding in HIV, O'Quigley, 
Hughes and Fenton (2001) considered the problem of 
finding the dose which maximizes the overall prob- 
ability of success. Here, failure is either a toxicity 
(in the HIV context, mostly an inability to main- 
tain treatment) or an unacceptably low therapeu- 
tic response. Zohar and O'Quigley (2006a) made a 
slight modification to the approach to better accom- 
modate the cancer setting. We take Y and V to be 
binary random variables (0, 1) where Y = 1 denotes 
a toxicity, 7 = a nontoxicity, 7=1 a response, 
and V = a nonresponse. As before, the probability 
of toxicity at the dose level Xj = Xj is defined by 

R(xj) = Pr(Y,- = l\Xj = Xj). 

The probability of response given no toxicity at dose 
level Xj = Xj is defined by 

Q( Xj ) = Pr(Vj = l\Xj = Xj ,Yj = 0), 

so that P{di) = Q(di){l — R(di)} is the probability 
of success. A successful trial would identify the dose 



level I such that P(di) > P{di) (for all i where i ^ I). 
Zohar and O'Quigley (2006b) called this dose the 
most successful dose and our purpose in this kind of 
study is, rather than find the MTD, to find the MSD. 
The relationship between toxicity and dose (xj) and 
the relationship between response given no toxicity 
and dose can be modeled through the use of two one- 
parameter models. Whereas R(di) and Q{di) refer 
to exact, usually unknown, probabilities, the model- 
based equivalents of these, ip and 4>, respectively, are 
only working approximations given by 

R{di) « ip(di,a) = af pfl ; 

Q(d i )^4>(d i ,b) = ^ pb , 

where < oci < ■ • ■ < oik < 1, — oo < a < oo, < j3\ < 
• ■ • < /3k < 1 and — oo < b < oo. For each dose, there 
exist unique values of a and b such that the approx- 
imation becomes an equality at that dose, but not 
necessarily exact at the other doses. After the in- 
clusion of j patients, R(di), Q{di), and P(di) are 
estimated by 

R(di) = ip(di,a,j); Q(di) = <p(di,bj); 

P = <j)(di,bj){l - tp(di,dj)}, 

where dj and bj maximize the log-likelihood (see 
O'Quigley, Hughes and Fenton, 2001). 

8. CONCLUSIONS 

More fully Bayesian approaches in a decision mak- 
ing context, and not simply making use of Bayesian 
estimators, have been suggested for use in the con- 
text of Phase I trial designs. These can be more in 
the Bayesian spirit of inference, in which we quan- 
tify prior information, observed from outside the 
trial as well as that solicited from clinicians and/or 
pharmacologists. Decisions are made more formally 
using tools from decision theory. Any prior informa- 
tion can subsequently be incorporated via the Bayes 
formula into a posterior density that also involves 
the actual current observations. Given the typically 
small sample sizes often used, a fully Bayesian ap- 
proach has some appeal in that we would not wish 
to waste any relevant information at hand. Unlike 
the set-up described by O'Quigley, Pepe and Fisher 
(1990), we could also work with informative priors. 

Gatsonis and Greenhouse (1992) considered two- 
parameter probit and logit models for dose response 
and studied the effect of different prior distributions. 
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Whitehead and Williamson (1998) carried out sim- 
ilar studies but with attention focusing on logistic 
models and beta priors. Whitehead and Williamson 
(1998) worked with some of the more classical no- 
tions from optimal design for choosing the dose lev- 
els in a bid to establish whether much is lost by us- 
ing suboptimal designs. O'Quigley, Pepe and Fisher 
(1990) ruled out criteria based on optimal design due 
to the ethical criterion of the need to attempt to as- 
sign the sequentially included patients at the most 
appropriate level for the patient. This same point 
was also emphasized by Whitehead and Williamson 
(1998). Certain contexts, however, may allow the use 
of more formal optimal procedures. 

For certain problems we may have good knowl- 
edge about some aspect of the problem and poor 
knowledge on the others. The overall dose-toxicity 
curve may be very poorly known but, if this were 
to be given for, say, one group, then we would have 
quite strong knowledge of the dose-toxicity curve for 
another group. Uninformative Bayes or maximum 
likelihood would then seem appropriate overall al- 
though we would still like to use information that we 
have, an example being the case of a group weak- 
ened by extensive prior therapy and thereby very 
likely to have a level strictly less than that for the 
other group. Careful parameterization would enable 
this information to be included as a constraint. How- 
ever, rather than work with a rigid and unmodifiable 
constraint, a Bayesian approach would allow us to 
specify the anticipated direction with high proba- 
bility while enabling the accumulating data to over- 
ride this assumed direction if the two run into se- 
rious conflict. Exactly the same idea could be used 
in a case where we believe there may be group het- 
erogeneity but that it be very unlikely the correct 
MTDs differ by more than a single level. This is es- 
pecially likely to be of relevance in situations where 
a defining prognostic variable, say the amount of 
prior treatment, is not very sharp so that group 
classifications may be subject to some error. If the 
resulting MTDs do differ we would not expect the 
difference to be very great. Incorporating such in- 
formation into the design will improve efficiency 

Stochastic approximation, which is an algorithm 
for finding the root of an unknown regression equa- 
tion, can be shown, under certain conditions, to be 
equivalent to recursive inversion of a linear model 
(Wu, 1985, 1986; Cheung and Elkind, 2010). In the 
light of those results, the CRM, in its basic form, 
could then be viewed as stochastic approximation 



leaning upon a particular dose-response model rather 
than a linear one. However, this characterization of 
the methodology is less fundamental than two oth- 
ers: (1) use of an underparameterized model and (2) 
restriction of the available doses to a limited finite 
set. 

The second of the above characterizations implies 
the necessity for the first (see Section 4.2). Consis- 
tency of stochastic approximation fails in the setting 
where we have a limited set of available responses 
(doses) and can only be achieved under conditions 
analogous to those outlined in this article (Shen 
and O'Quigley, 2000). Other algorithms similar to 
stochastic approximation (adaptive designs) rely on 
probabilistic rules to identify some percentile (dose) 
from an unknown distribution. Wu's (1985, 1986) 
findings suggest that there is usually some implicit 
model behind the algorithm. 

The CRM makes implicit models explicit ones; 
underparameterized, and therefore misspecified, but 
sufficiently flexible to obtain accurate estimates lo- 
cally although not reliable at points removed from 
those at which the bulk of experimentation takes 
place. The model, being explicit, readily enables ex- 
tension and generalization. The two group case, in- 
corporation of randomization about the target or 
the inclusion of partial prior information are, at least 
conceptually, relatively straightforward tasks. The 
framework is then in place to investigate other as- 
pects of dose-finding designs such as multigrade out- 
comes or the ability to exploit information on within- 
subject escalation. As for any method, there is al- 
ways room for improvement, although the results on 
optimality suggest that, for the basic problem, this 
room is not great. It is likely to be more fruitful to 
focus our attention on more involved problems such 
as continuous outcomes, subject heterogeneity, com- 
bined efficacy-toxicity studies, and studies involving 
escalation of two or more components. 
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